Chemical similarity using PubChem fingerprints



In [3]:

    
import pubchempy as pcp
from IPython.display import Image

First we'll get some compounds. Here we just use PubChem CIDs to retrieve, but you could search (e.g. using name, SMILES, SDF, etc.).



In [5]:

    
coumarin = pcp.Compound.from_cid(323)
Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=323&t=l')









    Out[5]:



In [7]:

    
coumarin_314 = pcp.Compound.from_cid(72653)
Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=72653&t=l')









    Out[7]:



In [8]:

    
coumarin_343 = pcp.Compound.from_cid(108770)
Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=108770&t=l')









    Out[8]:



In [10]:

    
aspirin = pcp.Compound.from_cid(2244)
Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=2244&t=l')









    Out[10]:

The similarity between two molecules is typically calculated using molecular fingerprints that encode structural information about the molecule as a series of bits (0 or 1). These bits represent the presence or absence of particular patterns or substructures — two molecules that contain more of the same patterns will have more bits in common, indicating that they are more similar.

The PubChem CACTVS fingerprint is available on each compound using the fingerprint method. This is returned as a hex-encoded string:



In [12]:

    
coumarin.fingerprint









    Out[12]:





u'0000037180703000000000000000000000000000000000000000304000000000000000810000001A00000000000C04809800300E80000400880220D208000208002020000888000608C80C262284311A823A20A4C01108A98780C0200E00000000000800000000000000100000000000000000'

We can decode this from hexadecimal and then display as a binary string as follows:



In [14]:

    
bin(int(coumarin.fingerprint, 16))









    Out[14]:





'0b1101110001100000000111000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011000001000000000000000000000000000000000000000000000000000000000000001000000100000000000000000000000000011010000000000000000000000000000000000000000000001100000001001000000010011000000000000011000000001110100000000000000000000100000000001000100000000010001000001101001000001000000000000000001000001000000000000010000000100000000000000000100010001000000000000000011000001000110010000000110000100110001000101000010000110001000110101000001000111010001000001010010011000000000100010000100010101001100001111000000011000000001000000000111000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000'

There is more information about the PubChem fingerprints at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt

The most commonly used measure for quantifying the similarity of two fingerprints is the Tanimoto Coefficient, given by:

$$ T = \frac{N_{ab}}{N_{a} + N_{b} - N_{ab}} $$

where $N_{a}$ and $N_{b}$ are the number of 1-bits (i.e corresponding to the presence of a pattern) in the fingerprints of molecule $a$ and molecule $b$ respectively. $N_{ab}$ is the number of 1-bits common to the fingerprints of both molecule $a$ and $b$. The Tanimoto coefficient ranges from 0 when the fingerprints have no bits in common, to 1 when the fingerprints are identical.

Here's a simple way to calculate the Tanimoto coefficient between two compounds in python:



In [16]:

    
def tanimoto(compound1, compound2):
    fp1 = int(compound1.fingerprint, 16)
    fp2 = int(compound2.fingerprint, 16)
    fp1_count = bin(fp1).count('1')
    fp2_count = bin(fp2).count('1')
    both_count = bin(fp1 & fp2).count('1')
    return float(both_count) / (fp1_count + fp2_count - both_count)

Let's try it out:



In [17]:

    
tanimoto(coumarin, coumarin)









    Out[17]:





1.0



In [18]:

    
tanimoto(coumarin, coumarin_314)









    Out[18]:





0.6011904761904762



In [19]:

    
tanimoto(coumarin, coumarin_343)









    Out[19]:





0.6011904761904762



In [20]:

    
tanimoto(coumarin_314, coumarin_343)









    Out[20]:





0.9529411764705882



In [21]:

    
tanimoto(coumarin, aspirin)









    Out[21]:





0.8211382113821138



In [22]:

    
tanimoto(coumarin_343, aspirin)









    Out[22]:





0.6123595505617978

This is a nice simple method, but not particularly efficient. If you are looking for better performance, check out Andrew Dalke's work:



In [ ]: